The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes at https://talkshow.is.tue.mpg.de.
translated by 谷歌翻译
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.
translated by 谷歌翻译
Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://elicit3d.github.io .
translated by 谷歌翻译
神经辐射场(NERF)已成功用于场景表示。最近的工作还使用基于NERF的环境表示形式开发了机器人导航和操纵系统。由于对象定位是许多机器人应用的基础,因此进一步释放了机器人系统中NERF的潜力,我们研究了NERF场景中的对象定位。我们提出了一个基于变压器的框架NERF-LOC,以在NERF场景中提取3D边界对象框。 Nerf-Loc将预先训练的NERF模型和相机视图作为输入,并产生标记为3D边界对象的框作为输出。具体来说,我们设计了一对平行的变压器编码器分支,即粗流和细流,以编码目标对象的上下文和详细信息。然后将编码的功能与注意层融合在一起,以减轻准确对象定位的歧义。我们已经将我们的方法与基于传统变压器的方法进行了比较,我们的方法可以实现更好的性能。此外,我们还提出了第一个基于NERF样品的对象定位基准Nerflocbench。
translated by 谷歌翻译
在各种基于学习的图像恢复任务(例如图像降解和图像超分辨率)中,降解表示形式被广泛用于建模降解过程并处理复杂的降解模式。但是,在基于学习的图像deblurring中,它们的探索程度较低,因为在现实世界中挑战性的情况下,模糊内核估计不能很好地表现。我们认为,对于图像降低的降解表示形式是特别必要的,因为模糊模式通常显示出比噪声模式或高频纹理更大的变化。在本文中,我们提出了一个框架来学习模糊图像的空间自适应降解表示。提出了一种新颖的联合图像re毁和脱蓝色的学习过程,以提高降解表示的表现力。为了使学习的降解表示有效地启动和降解,我们提出了一个多尺度退化注入网络(MSDI-NET),以将它们集成到神经网络中。通过集成,MSDI-NET可以适应各种复杂的模糊模式。 GoPro和Realblur数据集上的实验表明,我们提出的具有学识渊博的退化表示形式的Deblurring框架优于最先进的方法,具有吸引人的改进。该代码在https://github.com/dasongli1/learning_degradation上发布。
translated by 谷歌翻译
旨在恢复降级视频清晰框架的视频修复一直在吸引越来越多的关注。需要进行视频修复来建立来自多个未对准帧的时间对应关系。为了实现这一目标,现有的深层方法通常采用复杂的网络体系结构,例如集成光流,可变形卷积,跨框或跨像素自我发项层,从而导致昂贵的计算成本。我们认为,通过适当的设计,视频修复中的时间信息利用可能会更加有效。在这项研究中,我们提出了一个简单,快速但有效的视频修复框架。我们框架的关键是分组的时空转移,它简单且轻巧,但可以隐式建立框架间的对应关系并实现多框架聚合。加上用于框架编码和解码的基本2D U-NET,这种有效的时空移位模块可以有效地应对视频修复中的挑战。广泛的实验表明,我们的框架超过了先前的最先进方法,其计算成本的43%在视频DeBlurring和Video Denoisising上。
translated by 谷歌翻译
推断人类场景接触(HSC)是了解人类如何与周围环境相互作用的第一步。尽管检测2D人类对象的相互作用(HOI)和重建3D人姿势和形状(HPS)已经取得了重大进展,但单个图像的3D人习惯接触的推理仍然具有挑战性。现有的HSC检测方法仅考虑几种类型的预定义接触,通常将身体和场景降低到少数原语,甚至忽略了图像证据。为了预测单个图像的人类场景接触,我们从数据和算法的角度解决了上述局限性。我们捕获了一个名为“真实场景,互动,联系和人类”的新数据集。 Rich在4K分辨率上包含多视图室外/室内视频序列,使用无标记运动捕获,3D身体扫描和高分辨率3D场景扫描捕获的地面3D人体。 Rich的一个关键特征是它还包含身体上精确的顶点级接触标签。使用Rich,我们训练一个网络,该网络可预测单个RGB图像的密集车身场景接触。我们的主要见解是,接触中的区域总是被阻塞,因此网络需要能够探索整个图像以获取证据。我们使用变压器学习这种非本地关系,并提出新的身体场景接触变压器(BSTRO)。很少有方法探索3D接触;那些只专注于脚的人,将脚接触作为后处理步骤,或从身体姿势中推断出无需看现场的接触。据我们所知,BSTRO是直接从单个图像中直接估计3D身体场景接触的方法。我们证明,BSTRO的表现明显优于先前的艺术。代码和数据集可在https://rich.is.tue.mpg.de上获得。
translated by 谷歌翻译
缺乏大规模嘈杂的图像对限制了监督的去噪方法在实际应用中部署。虽然现有无监督的方法能够在没有地面真理清洁图像的情况下学习图像去噪,但它们要么在不切实际的设置下表现出差或工作不佳(例如,配对嘈杂的图像)。在本文中,我们提出了一种实用的无监督图像去噪方法,以实现最先进的去噪性能。我们的方法只需要单一嘈杂的图像和噪声模型,可以在实际的原始图像去噪中轻松访问。它迭代地执行两个步骤:(1)构造具有来自噪声模型的随机噪声的噪声噪声数据集; (2)在噪声 - 嘈杂数据集上培训模型,并使用经过培训的模型来优化嘈杂的图像以获得下一轮中使用的目标。我们进一步近似我们的全迭代方法,具有快速算法,以实现更高效的培训,同时保持其原始高性能。实验对现实世界,合成和相关噪声的实验表明,我们提出的无监督的去噪方法具有卓越的现有无监督方法和具有监督方法的竞争性能。此外,我们认为现有的去噪数据集质量低,只包含少数场景。为了评估现实世界应用中的原始图像去噪表现,我们建立了一个高质量的原始图像数据集Sensenoise-500,包含500个现实生活场景。数据集可以作为更好地评估原始图像去噪的强基准。代码和数据集将在https://github.com/zhangyi-3/idr发布
translated by 谷歌翻译
缺乏大规模的真正的原始图像去噪数据集导致挑战训练训练模型的综合性原始图像噪声挑战。然而,实际原始图像噪声由许多噪声源贡献,并且在不同的传感器之间变化很大。现有方法无法准确模拟所有噪声源,并为每个传感器构建噪声模型也是费力的。在本文中,我们介绍了一种新的视角,通过直接从传感器的真实噪声中取样来合成噪声。它本质上为不同的摄像机传感器固有生成准确的原始图像噪声。两种高效且通用技术:图案对齐的贴片采样和高位重建可以分别精确地合成空间相关噪声和高位噪声。我们对SIDD和ELD数据集进行系统实验。结果表明,(1)我们的方法优于现有方法,并在不同的传感器和照明条件下表现出广泛的概括。 (2)最近得出的基于DNN的噪声建模方法的结论实际上是基于不准确的噪声参数。基于DNN的方法仍然不能超越基于物理的统计方法。
translated by 谷歌翻译